Automatic Language Identification in Code-Switched Hindi-English Social Media Text
نویسندگان
چکیده
Natural Language Processing (NLP) tools typically struggle to process code-switched data and so linguists are commonly forced annotate such manually. As this becomes more readily available, automatic increasingly needed help speed up the annotation improve consistency. Last year, a toolkit was developed semi-automatically transcribed bilingual Vietnamese-English speech with token-based language information POS tags (hereafter CanVEC toolkit, L. Nguyen & Bryant, 2020). In work, we extend methodology another pair, Hindi-English, explore extent which can standardise automation process. Specifically, applied principles behind from International Conference on (ICON) 2016 shared task, consists of social media posts (Facebook, Twitter WhatsApp) that have been annotated (Molina et al., 2016). We used ICON-2016 annotations as gold-standard labels in identification task. Ultimately, our tool achieved an F1 score 87.99% data. then evaluated first 500 tokens each subset manually, found almost 40% all errors were caused entirely by problems gold-standard, i.e., system correct. It is thus likely overall accuracy higher than reported. This shows great potential for effectively automating corpora, different combinations, genres. finally discuss some limitations approach release code human evaluation together paper.
منابع مشابه
Shallow Parsing Pipeline - Hindi-English Code-Mixed Social Media Text
In this study, the problem of shallow parsing of Hindi-English code-mixed social media text (CSMT) has been addressed. We have annotated the data, developed a language identifier, a normalizer, a part-of-speech tagger and a shallow parser. To the best of our knowledge, we are the first to attempt shallow parsing on CSMT. The pipeline developed has been made available to the research community w...
متن کاملPOS Tagging of English-Hindi Code-Mixed Social Media Content
Code-mixing is frequently observed in user generated content on social media, especially from multilingual users. The linguistic complexity of such content is compounded by presence of spelling variations, transliteration and non-adherance to formal grammar. We describe our initial efforts to create a multi-level annotated corpus of Hindi-English codemixed text collated from Facebook forums, an...
متن کاملShallow Parsing Pipeline for Hindi-English Code-Mixed Social Media Text
In this study, the problem of shallow parsing of Hindi-English code-mixed social media text (CSMT) has been addressed. We have annotated the data, developed a language identifier, a normalizer, a part-of-speech tagger and a shallow parser. To the best of our knowledge, we are the first to attempt shallow parsing on CSMT. The pipeline developed has been made available to the research community w...
متن کاملPOS Tagging of Hindi-English Code Mixed Text from Social Media: Some Machine Learning Experiments
We discuss Part-of-Speech(POS) tagging of Hindi-English Code-Mixed(CM) text from social media content. We propose extensions to the existing approaches, we also present a new feature set which addresses the transliteration problem inherent in social media. We achieve an 84% accuracy with the new feature set. We show that the context and joint modeling of language detection and POS tag layers do...
متن کاملSentiment Identification in Code-Mixed Social Media Text
Sentiment analysis is the Natural Language Processing (NLP) task dealing with the detection and classification of sentiments in texts. While some tasks deal with identifying presence of sentiment in text (Subjectivity analysis), other tasks aim at determining the polarity of the text categorizing them as positive, negative and neutral. Whenever there is presence of sentiment in text, it has a s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Journal of open humanities data
سال: 2021
ISSN: ['2059-481X']
DOI: https://doi.org/10.5334/johd.44